{
 "metadata": {
  "name": "",
  "signature": "sha256:b1ad471bea57a68dbfc7493759de8e34f6f02d41da65895cb604e6d86aba73fe"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Restricted Boltzmann Machines & Deep Belief Networks"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "by Khaled Nasr as a part of a <a href=\"https://www.google-melange.com/gsoc/project/details/google/gsoc2014/khalednasr92/5657382461898752\">GSoC 2014 project</a> mentored by Theofanis Karaletsos and Sergey Lisitsyn "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this notebook we'll take a look at training and evaluating [restricted Boltzmann machines](http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine) and [deep belief networks](http://en.wikipedia.org/wiki/Deep_belief_network) in Shogun."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Introduction"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Restricted Boltzmann Machines"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "An [RBM](http://deeplearning.net/tutorial/rbm.html) is an **energy based probabilistic model**. It consists of two groups of variables: the **visible variables** $ v $ and the **hidden variables** $ h $. The key assumption that RBMs make is that the hidden units are conditionally independent given the visible units, and vice versa.\n",
      "\n",
      "The RBM defines its distribution through an **energy function** $E(v,h)$, which is a function that assigns a number (called energy) to each possible state of the visible and hidden variables. The probability distribution is defined as:\n",
      "\n",
      "$$ P(v,h) := \\frac{\\exp(-E(v,h))}{Z} , \\qquad Z = \\sum_{v,h} \\exp(-E(v,h))$$\n",
      "\n",
      "where $Z$ is a constant that makes sure that the distribution sums/integrates to $1$. This distribution is also called a Gibbs distribution and $Z$ is sometimes called the **partition function**.\n",
      "\n",
      "From the definition of $P(v,h)$ we can see that the probability of a configuration increases as its energy decreases. Training an RBM in an unsupervised manner involves manipulating the energy function so that it would assign low energy (and therefore high probability) to values of $v$ that are similar to the training data, and high energy to values that are far from the training data.\n",
      "\n",
      "For an RBM with binary visible and hidden variables the energy function is defined as:\n",
      "\n",
      "$ E(v,h) = -\\sum_i \\sum_j h_i W_{ij} v_j - \\sum_i h_i c_i - \\sum_j v_j b_j $\n",
      "\n",
      "where $b \\in \\mathbb{R^n} $ is the bias for the visible units, $c \\in \\mathbb{R^m}$ is the bias for hidden units and $ W \\in \\mathbb{R^{mxn}}$ is the weight matrix between the hidden units and the visible units.\n",
      "\n",
      "Plugging that definition into the definition of the probability distribution will yield the following conditional distributions for each of the hidden and visible variables:\n",
      "\n",
      "$$ P(h=1|v) = \\frac{1}{1+exp(-Wv-c)}, \\quad P(v=1|h) = \\frac{1}{1+exp(-W^T h-b)} $$\n",
      "\n",
      "We can do a quick visualization of an RBM:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%pylab inline\n",
      "%matplotlib inline\n",
      "import os\nSHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')\n",
      "import networkx as nx\n",
      "  \n",
      "G = nx.Graph()\n",
      "pos = {}\n",
      "\n",
      "for i in range(8):\n",
      "    pos['V'+str(i)] = (i,0)\n",
      "    pos['H'+str(i)] = (i,1)\n",
      "    \n",
      "    for j in range(8): G.add_edge('V'+str(j),'H'+str(i))\n",
      "\n",
      "figure(figsize=(7,2))\n",
      "nx.draw(G, pos, node_color='y', node_size=750)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The nodes labeled V are the visible units, the ones labeled H are the hidden units. There's an indirected connection between each hidden unit and all the visible units and similarly for visible unit. There are no connections among the visible units or among the hidden units, which implies the the hidden units are independent of each other given the visible units, and vice versa."
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Deep Belief Networks"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If an RBM is properly trained, the hidden units learn to extract useful features from training data. An obvious way to go further would be transform the training data using the trained RBM, and train yet another RBM on the transformed data. The second RBM will learn to extract useful features from the features that the first RBM extracts. The process can be repeated to add a third RBM, and so.\n",
      "\n",
      "When stacked on top of each other, those RBMs form a Deep Belief Network [1]. The network has directed connections going from the units in each layer to units in the layer below it. The connections between the top layer and the layer below it are undirected. The process of stacking RBMs to form a DBN is called pre-training the DBN.\n",
      "\n",
      "After pre-training, the DBN can be used to initialize a similarly structured neural network which can be used for supervised classification.\n",
      "\n",
      "We can do a visualization of a 4-layer DBN:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "G = nx.DiGraph()\n",
      "pos = {}\n",
      "\n",
      "for i in range(8):\n",
      "    pos['V'+str(i)] = (i,0)\n",
      "    pos['H'+str(i)] = (i,1)\n",
      "    pos['P'+str(i)] = (i,2)\n",
      "    pos['Q'+str(i)] = (i,3)\n",
      "    \n",
      "    for j in range(8): \n",
      "        G.add_edge('H'+str(j),'V'+str(i))\n",
      "        G.add_edge('P'+str(j),'H'+str(i))\n",
      "        G.add_edge('Q'+str(j),'P'+str(i))\n",
      "        G.add_edge('P'+str(j),'Q'+str(i))\n",
      "\n",
      "figure(figsize=(5,5))\n",
      "nx.draw(G, pos, node_color='y', node_size=750)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "RBMs in Shogun"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "RBMs in Shogun are handled through the [CRBM](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CRBM.html) class. We create one by specifying the number of visible units and their type (binary, Gaussian, and Softmax visible units are supported), and the number of hidden units (only binary hidden units are supported).\n",
      "\n",
      "In this notebook we'll train a few RBMs on the USPS dataset for handwritten digits. We'll have one RBM for each digit class, making 10 RBMs in total:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from shogun import RBM, RBMVUT_BINARY, Math\n",
      "\n",
      "# initialize the random number generator with a fixed seed, for repeatability\n",
      "Math.init_random(10)\n",
      "\n",
      "rbms = []\n",
      "for i in range(10):\n",
      "    rbms.append(RBM(25, 256, RBMVUT_BINARY)) # 25 hidden units, 256 visible units (one for each pixel in a 16x16 binary image)\n",
      "    rbms[i].initialize_neural_network()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we'll load the USPS dataset:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from scipy.io import loadmat\n",
      "\n",
      "dataset = loadmat(os.path.join(SHOGUN_DATA_DIR, 'multiclass/usps.mat'))\n",
      "\n",
      "Xall = dataset['data']\n",
      "# the usps dataset has the digits labeled from 1 to 10 \n",
      "# we'll subtract 1 to make them in the 0-9 range instead\n",
      "Yall = np.array(dataset['label'].squeeze(), dtype=np.double)-1 "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we'll move on to training the RBMs using Persistent Contrastive Divergence [2]. Training using regular Contrastive Divergence [3] is also [supported](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CRBM.html#abf0cdced1524987a687535a2a8de5cb3).The optimization is performed using [Gradient Descent](http://en.wikipedia.org/wiki/Gradient_descent). The training progress can be monitored using the reconstruction error or the [psuedo-likelihood](http://en.wikipedia.org/wiki/Pseudolikelihood). Check the [public attributes](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CRBM.html#pub-attribs) of the CRBM class for all the available training options. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from shogun import RealFeatures, RBMMM_PSEUDO_LIKELIHOOD\n",
      "\n",
      "# uncomment this line to allow the training progress to be printed on the console\n",
      "#from shogun import MSG_INFO; rbms[0].io.set_loglevel(MSG_INFO)\n",
      "\n",
      "for i in range(10):\n",
      "    # obtain the data for digit i\n",
      "    X_i = Xall[:,Yall==i]\n",
      "    \n",
      "    # binarize the data for use with the RBM\n",
      "    X_i = (X_i>0).astype(float64)\n",
      "    \n",
      "    # set the number of contrastive divergence steps\n",
      "    rbms[i].cd_num_steps = 5\n",
      "    \n",
      "    # set the gradient descent parameters\n",
      "    rbms[i].gd_learning_rate = 0.005\n",
      "    rbms[i].gd_mini_batch_size = 100\n",
      "    rbms[i].max_num_epochs = 30\n",
      "    \n",
      "    # set the monitoring method to pseudo-likelihood\n",
      "    rbms[i].monitoring_method = RBMMM_PSEUDO_LIKELIHOOD\n",
      "    \n",
      "    # start training\n",
      "    rbms[i].train(RealFeatures(X_i))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "After training, we can draw samples from the RBMs to see what they've learned. Samples are drawn using [Gibbs sampling](http://en.wikipedia.org/wiki/Gibbs_sampling). We'll draw 10 samples from each RBM and plot them:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "samples = zeros((256,100))\n",
      "for i in range(10):\n",
      "    # initialize the sampling chain with a random state for the visible units\n",
      "    rbms[i].reset_chain()\n",
      "    \n",
      "    # run 10 chains for a 1000 steps to obtain the samples\n",
      "    samples[:,i*10:i*10+10] = rbms[i].sample_group(0, 1000, 10)\n",
      "\n",
      "# plot the samples\n",
      "figure(figsize=(7,7))\n",
      "for i in range(100):\n",
      "\tax=subplot(10,10,i+1)\n",
      "\tax.imshow(samples[:,i].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)\n",
      "\tax.set_xticks([])\n",
      "\tax.set_yticks([])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "DBNs in Shogun"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we'll create a DBN, pre-train it on the digits dataset, and use it initialize a neural network which we can use for classification.\n",
      "\n",
      "DBNs are handled in Shogun through the [CDeepBeliefNetwork](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDeepBeliefNetwork.html) class. We create a network by specifying the number of visible units it has, and then add the desired number of hidden layers using [add_hidden_layer()](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDeepBeliefNetwork.html#ae765aad072aef83a41906140c9f9a8c5). When done, we call [initialize_neural_network()](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDeepBeliefNetwork.html#aae4e01a7cfd234a1895b40be197ef83a) to initialize the network:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from shogun import DeepBeliefNetwork\n",
      "\n",
      "dbn = DeepBeliefNetwork(256) # 256 visible units\n",
      "dbn.add_hidden_layer(200) # 200 units in the first hidden layer\n",
      "dbn.add_hidden_layer(300) # 300 units in the second hidden layer\n",
      "\n",
      "dbn.initialize_neural_network()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Then we'll pre-train the DBN on the USPS dataset. Since we have 3 layers, the DBN will be pre-trained as two RBMs: one that consists of the first hidden layer and the visible layer, the other consists of the first hidden layer and the second hidden layer. Pre-training parameters can be specified using the [pt_* public attributes](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDeepBeliefNetwork.html#a0753163b619371f111d55ade639752c4) of the class. Each of those attributes is an [SGVector](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1SGVector.html) whose length is the number of RBMs (2 in our case). It can be used to set the parameters for each RBM indiviually. [SGVector's set_const()](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1SGVector.html#a8bce01a1fc41a734d9b5cf1533fd7a2a) method can also be used to assign the same parameter value for all RBMs."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# take 3000 examples for training, the rest for testing\n",
      "Xtrain = Xall[:,0:3000]\n",
      "Ytrain = Yall[0:3000]\n",
      "Xtest = Xall[:,3000:-1]\n",
      "Ytest = Yall[3000:-1]\n",
      "\n",
      "# set the number of contrastive divergence steps\n",
      "dbn.pt_cd_num_steps.set_const(5)\n",
      "\n",
      "# set the gradient descent parameters\n",
      "dbn.pt_gd_learning_rate.set_const(0.01)\n",
      "dbn.pt_gd_mini_batch_size.set_const(100)\n",
      "dbn.pt_max_num_epochs.set_const(30)\n",
      "\n",
      "# binarize the data and start pre-training\n",
      "dbn.pre_train(RealFeatures((Xtrain>0).astype(float64)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "After pre-training, we can visualize the features learned by the first hidden layer by plotting the weights between some hidden units and the visible units:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# obtain the weights of the first hidden layer\n",
      "w1 = dbn.get_weights(0)\n",
      "\n",
      "# plot the weights between the first 100 units in the hidden layer and the visible units\n",
      "figure(figsize=(7,7))\n",
      "for i in range(100):\n",
      "\tax1=subplot(10,10,i+1)\n",
      "\tax1.imshow(w1[i,:].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)\n",
      "\tax1.set_xticks([])\n",
      "\tax1.set_yticks([])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we'll use the DBN to initialize a [CNeuralNetwork](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralNetwork.html). This is done through the [convert_to_neural_network()](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDeepBeliefNetwork.html#a8c179cd9a503b2fa78b9bfe10ae473e5) method. The neural network will consist of a [CNeuralInputLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralInputLayer.html) with 256 neurons, a [CNeuralLogisticLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralLogisticLayer.html) with 200 neurons, and another [CNeuralLogisticLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralLogisticLayer.html) with 300 neurons. We'll also add a [CNeuralSoftmaxLayer](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNeuralSoftmaxLayer.html) as an output layer so that we can train the network in a supervised manner. We'll also train the network on the training set:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from shogun import NeuralSoftmaxLayer, MulticlassLabels\n",
      "\n",
      "# get the neural network\n",
      "nn = dbn.convert_to_neural_network(NeuralSoftmaxLayer(10))\n",
      "\n",
      "# add some L2 regularization\n",
      "nn.l2_coefficient = 0.0001\n",
      "\n",
      "# start training\n",
      "nn.set_labels(MulticlassLabels(Ytrain))\n",
      "_ = nn.train(RealFeatures(Xtrain))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And finally we'll measure the classification accuracy on the test set:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from shogun import MulticlassAccuracy\n",
      "\n",
      "predictions = nn.apply_multiclass(RealFeatures(Xtest))\n",
      "accuracy = MulticlassAccuracy().evaluate(predictions, MulticlassLabels(Ytest)) * 100\n",
      "\n",
      "print(\"Classification accuracy on the test set =\", accuracy, \"%\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "References"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "- [1] [A Fast Learning Algorithm for Deep Belief Nets, Hinton, 2006](http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf)\n",
      "- [2] [Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient, Tieleman, 2008](http://www.machinelearning.org/archive/icml2008/papers/638.pdf)\n",
      "- [3] [Training Products of Experts by Minimizing Contrastive Divergence, Hinton, 2002](http://learning.cs.toronto.edu/~hinton/csc2535/readings/nccd.pdf)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
